This dataset contains information on various attributes of homes sold in and around the Sacramento, California. The data is from real estate transactions taking place over a 5 day period. The data is from the website for the
SpatialKeysoftware and can be loaded using the R caret package (Kuhn 2020) or downloaded as acsvfrom here.
## [1] "The number of training examples is: 521"
## [1] "The number of predictors is: 8"
The graphic below shows that all 521 rows of the dataset are complete.
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## city zip beds baths sqft type latitude longitude
## 521 1 1 1 1 1 1 1 1 0
## 0 0 0 0 0 0 0 0 0
| Predictor | Type | Description |
|---|---|---|
| city | categorical | The city where the house is located. |
| zip | categorical | The zip code where the house is located. |
| beds | categorical | The number of bedrooms in the house. |
| baths | categorical | The number of bathrooms in the house. |
| type | categorical | The type of dwelling as either Residential, Condo or Multi-Family. |
| sqft | continuous | The size of the house measured in square feet. |
| latitude | continuous | The latitude geospatial location of the house. |
| longitude | continuous | The longitude geospatial location of the house. |
The most common number of bedrooms for a house is 3 with the second most common number being 4 bedrooms. Just under 100 homes of the 521 in the training set contain 2 bedrooms and very few homes contain 1, 5 or 6 bedrooms.
The majority of the houses (~ 2/3) contain 2 bathrooms. One bathroom was second most common. Having 3 bathrooms was only present in about 15% of homes and having 1.5, 2.5, 3.5, 4, 4.5 or 5 bathrooms was very rare.
Almost all of the homes were classifed as
Residentialwith extremely few homes being classifed asCondoorMulti-family.
Figure 1
| Smallest House Size | Largest House Size |
|---|---|
| 484 | 4878 |
House Size The distribution of house size is pulled to the right with outliers of large house size. The homes range in size from 484 sq ft on the low end to 4878 sq ft on the high end. The peak of the density centers around ~ 1266 sq ft.
Figure 2
Number of Bedrooms As expected, with a rising number of bedrooms, the bulk of the distribution for house size also rises but there is variability within different levels of bedrooms. After one-bedrooms, there is a wide range of sizes for homes with 2 bedrooms and beyond. However, as the number of bedrooms rises, the density becomes thinner and thinner as fewer and fewer homes are have 4 or more bedrooms.
Figure 3
Number of Bathrooms Like with number of bedrooms, the house size increases with the number of bathrooms present but there still is considerable variability. Houses with 1.5 bathrooms have a very small range of house sizes. The density of the distributions becomes very thin as the number of bathrooms increases. Something potentially to consider is since there is considerable variability and overlap in size for houses with different bathrooms the question becomes if number of bathrooms will be a good predictor. On the extreme ends of bath number, the distributions of size are different enough and do not overlap.
Figure 4
The largest correlations are between the number of bedrooms and square footage and the number of bathrooms and square footage. This is expected but this may indicate that the numbers of bathrooms and bedrooms are redundant information and just modelling using house size may obtain a better model. This will have tobe assessed during the modelling process. The number of beds and bathrooms also have a modest positive correlation of 0.66. The strongest negative correlation is betwen the zip code and the longitude. The rest of the predictors are not highly correlated with each other.
Figure 5
The following set of visuals describe the relationships between house prices and the attributes of that house like size, number bedrooms and bathrooms etc. These visuals should provide clues as to which variables will be the most useful in a model that attempts to infer house price.
House Price vs House Size The scatterplot below shows a primarily linear relationship between house price and house size with a few notable outliers. This plot indicates that a linear model may be appropriate here.
Figure 6
House Price vs Number of Bedrooms As expected, on average the house price rises as the number of bedrooms increases however, within each level of bedrooms, there can be quite a bit of variability and overlap between levels. For example, the price for houses with 3 and 4 bedrooms varies between less than $60k on the low end and greater than $500k on the high end. Two and five bedroom homes also overlalp with these distributions but have a smaller range. On the extreme ends for number of bedrooms, there is sepration in the distributions but this variable may not strongly differentiate price for the middle levels of bedrooms (2, 3 and 4 bedrooms).
Figure 7
House Price vs Number of Bathrooms House price seems to have less variability when considering a certain number of bathrooms compared to when considering number bedrooms as above. Houses with 2 bathrooms have the biggest range in price. Overall the trend is an increasing price with increasing number of bathrooms. However, like with number of bedrooms, some of the levels overlap indicating that perhaps this may not be the best variable for inferring price reliably.
Figure 8
Hous Price vs Housing Type Overall, there is no trend with the price and the housing type. The peak price of each distribution is somewhat similar, give or take a few thousand dollars. The residential homes have the largest skewness in price. The lack of trend with housing type indicates this may not be the explanatory variable.
Figure 9
House Price vs City The plot below shows the price distribution in the cities that contained records for 16 or more houses. On average, different cities will have different house prices. But that averages between cities do not seem to deviate from each other significantly.
Figure 10
House Price vs Geospatial LocationThe 3D plot below attempts to explore house location and price. The 3D nature of the plot makes the spatial distribution of house price more apparent. There is clearly a spatial gradient for the house price with higher latitudes and longitudes leading to higher prices. The geospatial coordinates may be useful when in a linear model.
Figure 11
After conducting this visual exploration, there doesn’t seem to be any problems warranting major concern. The dataset is complete and the variables have reasonable distributions (i.e. nothing is terribly skewed)
The research question for this project will be about discovering the relation between house price and the predictors. How will changing the values of the predictors influence the house price? Are there predictors that affect the house price more than others? This dataset provides some basic attributes about a house that should prove to be useful in inferring house price. Based on the visual analysis, the most influential predictors are likely square footage and geo coordinates and potentially the number of bed and bathrooms.
The primary inferential research question for this probject will be`
Related questions will include:
Which of the predictors are most strongly associated with the response of house price?
If the bed and bathroom number or house size increase, how does this affect the house price?
The tools for this exploratory analysis include the R language (R Core Team 2019), and the following R packages: tidyverse (Wickham 2017), mice (van Buuren and Groothuis-Oudshoorn 2011), knitr (Xie 2014), here (Müller 2017), ggcorrplot (Kassambara 2019), patchwork (Pedersen 2019) and plotly (Sievert 2018).
This analysis uses the Sacramento dataset obtained through the R caret package (Kuhn 2020).
Kassambara, Alboukadel. 2019. Ggcorrplot: Visualization of a Correlation Matrix Using ’Ggplot2’. https://CRAN.R-project.org/package=ggcorrplot.
Kuhn, Max. 2020. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.
Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
Pedersen, Thomas Lin. 2019. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.
Sievert, Carson. 2018. Plotly for R. https://plotly-r.com.
van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://www.jstatsoft.org/v45/i03/.
Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.